Non-record: 11L NativeFlowMatcher + Legal TTT — val_bpb 1.1199 (3-seed mean no-TTT: 1.1225) by Christopher-Lee-McClendon · Pull Request #1170 · openai/parameter-golf

Christopher-Lee-McClendon · 2026-03-31T04:52:31Z

Summary

Non-record submission exploring NativeFlowMatcher (NFM) — a 393K-parameter OT-CFM (Optimal Transport Conditional Flow Matching) velocity network that applies gated hidden-state correction to transformer hidden states, jointly trained with the AR objective. The Flow Matching module is trained as distribution transport, but used at inference as a small residual correction.

Results

Three-seed reproducibility (training-time sliding window, no TTT):

Seed	SLURM Job	Training val_bpb	Sliding BPB (no TTT)	Artifact Bytes
42	55342820	1.1380	1.12312	15,745,776
1337	55398556	1.1385	1.12367	15,736,933
2025	55398557	1.1359	1.12077	15,745,950
Mean ± Std		1.1375 ± 0.0014	1.12252 ± 0.00151

Primary (seed=42, with legal TTT):

Evaluation	val_bpb
Sliding window (stride=64), no TTT	1.12312
Sliding window (stride=64), legal TTT	1.11991

Legal TTT gain: −0.00321 BPB

Legal TTT evaluation for seeds 1337 and 2025 is pending (SLURM jobs 55411651–55411654).

Architecture

11L/512D/GQA(8H/4KV), 3×MLP, 27.5M params total
NativeFlowMatcher: 256-dim hidden velocity network with sinusoidal time conditioning, gated Euler step at t=1
XSA on all 11 layers, BigramHash(4096,128), LeakyReLU(0.5)², value residual, gated attention
Mixed int6/int5 quantization + zstd-16 compression
Artifact: 15,745,776 bytes (254K headroom under 16MB cap)

Training

7,000 steps on 1×A100 PCIe 40GB, ~3.86 hours per seed
Muon + Adam optimizer, 2048 sequence length
Three seeds completed: 42, 1337, 2025

Ablation Studies

2×2 Matrix: NFM × TTT (isolating NFM contribution):

Configuration	Params	No TTT (BPB)	Legal TTT (BPB)	Δ (TTT effect)
Base (no NFM)	27,137,223	1.12087	pending	pending
NFM (hd=256, lw=0.1)	27,530,952	1.12312	1.11991	−0.00321
Δ (NFM effect)	+393,729	+0.00225	pending	—

Base retraining is running. Loss weight sweep (lw=0.01, 0.05, 0.20) and hidden dim sweep (hd=128, 512) are queued.

Supplementary: E2E TTT + FlowRefiner 7k eval completed: legal TTT BPB = 1.12418.

Limitations

Three-seed reproducibility achieved (no-TTT): Mean sliding BPB = 1.12252 ± 0.00151. Legal TTT eval pending for seeds 1337, 2025.
Non-record — This submission documents the NFM idea and its interaction with legal TTT. Not sure whether the NFM is worth the extra compute cost for 10 min training / 10 min eval. Number of training steps was chosen to be consistent with those of similar base models (without NFM).
NFM adds +0.00225 BPB vs matched base (no NFM) at 7k steps — the extra 393K params do not improve val_bpb. The idea may be more relevant at longer training schedules or combined with other techniques.

Credits

Base architecture (PR #549, @abaybektursun), Muon (baseline), BigramHash/SmearGate (PR #65, @aquariouserworkman), XSA (PR #187/#265, @Idan3011/@unnir), mixed quant (PR #76), sliding window (PR #50, @mattqlf), legal TTT (PR #77, @samacqua, PR #461 @Christopher-Lee-McClendon ), VE/PartialRoPE/LN Scale (PR #315/#374, @jfprincz/@unnir), gated attention/value residual (PR #940), EMA (PR #65, @aquariouserworkman)

Checklist

- NativeFlowMatcher: 393K-param OT-CFM velocity network with gated hidden-state correction - Legal score-first TTT: SGD lr=0.002, 10 epochs, freeze_blocks=2 - val_bpb: 1.11991 (sliding window stride=64, legal TTT) - val_bpb: 1.12312 (sliding window stride=64, no TTT) - Artifact: 15,745,776 bytes (254K headroom) - Single-seed (42) exploratory submission - Supplementary: eval logs, SLURM scripts, comparison data

- 2x2 matrix: NFM x TTT with base no-TTT baseline (1.12087) - Loss weight sweep: 0.01, 0.05, 0.1, 0.2 - Hidden dim sweep: 128, 256, 512 - 13 SLURM jobs submitted (6 train + 7 eval) - Results pending, will update when jobs complete

Christopher-Lee-McClendon · 2026-03-31T12:08:54Z

Ablation Studies Submitted

13 SLURM jobs have been submitted to run comprehensive ablation studies for this NFM submission:

2×2 Matrix: NFM × Legal TTT

Isolating the individual contributions of NFM and legal TTT at matched 7k steps.

Configuration	Params	No TTT (BPB)	Legal TTT (BPB)
Base (no NFM)	27,137,223	1.12087 ✅	pending (→55398695)
NFM (hd=256, lw=0.1)	27,530,952	1.12312 ✅	1.11991 ✅

NFM Hyperparameter Sweeps

Loss weight sweep (hidden_dim=256, seed=42):

lw=0.01 → jobs 55398696→55398699
lw=0.05 → jobs 55398697→55398700
lw=0.10 (default) → 1.12312 ✅
lw=0.20 → jobs 55398698→55398701

Hidden dim sweep (loss_weight=0.1, seed=42):

hd=128 → jobs 55398702→55398704
hd=256 (default) → 1.12312 ✅
hd=512 → jobs 55398703→55398705

Also pending

3-seed reproducibility runs (seeds 1337, 2025): jobs 55398556–55398561
E2E TTT+Flow 7k reeval with 5h wallclock: job 55398555

Results will be updated in README as jobs complete.

- Training completed for seeds 42, 1337, 2025 (all 7k steps) - 3-seed mean sliding BPB (no TTT): 1.12252 ± 0.00151 - Seed 42: 1.12312, Seed 1337: 1.12367, Seed 2025: 1.12077 - Legal TTT eval jobs submitted (SLURM 55411651-55411654) - Added completed E2E TTT+Flow eval log (SLURM 55398555, BPB=1.12418) - Added training logs and SLURM scripts for all seed runs - Updated README with 3-seed results table and training trajectories - Updated submission.json with per-seed metrics and job IDs

- Three-seed legal TTT: mean 1.11928 ± 0.00146 (seeds 42, 1337, 2025) - 2×2 NFM×TTT matrix complete: NFM hurts by +0.002 (no-TTT) / +0.001 (TTT) - Loss weight sweep: lw=0.05 best but still +0.002 worse than base - Hidden dim sweep: hd=512 best but still +0.001 worse than base - Updated limitations section to reflect negative result conclusion

MatoTeziTanka · 2026-04-11T20:02:36Z

Community Review — Non-record: 11L NativeFlowMatcher + Legal TTT — val_bpb 1.1199 (3-seed mean no-TTT: 1.1225)

BPB: 1.1199 | Compliance: FLAG — hashed n-gram cache with target-in-key (PR #779 family pattern)

What I found in the code (head SHA 039f4e891256, file records/track_non_record_16mb/2026-03-31_11L_NativeFlowMatcher_LegalTTT/train_gpt.py):

The n-gram lookup key at line 1475 is constructed by XOR-ing the target token into the hash:

line 1475: full_key = <hash> ^ (tgt_np * ng_primes[...]) & mask

This matches the full_key = ((ctx_hash ^ (target * primes[k])) & mask) construction that @valerio-oai ruled disallowed on PR #779 (comment 4145781641, 2026-03-27). Per the mechanism explanation, hashing the target token into the lookup key only reweights the correct token — in the hash-collision limit this drives P(correct) → 1 regardless of the data, which inflates the reported BPB without producing real compression.

Per Issue #1017 condition 1, p_t may depend only on the artifact and x_1...x_{t-1}. Because the lookup key at line 1475 is a function of the target token, the count read at scoring position t depends on x_t itself — which is the core violation the #779 ruling targets.

Cluster context: this same structural pattern has been closed on 15+ PRs under the #779 ruling as of 2026-04-11 (#779 itself, #770, #798, #808, #825, #786, #797, #909, #940, #761, #776, #788, #774, #778, #715, #758, #702 upstream, #1488). The base neural model is unaffected by this flag — in every case where the authors resubmitted without the n-gram cache, the base val_bpb has been in the ~1.10-1.15 range (standard for the SP1024 11L class).

CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.06s, dim=512, layers=11, vocab=1024, code=115032 B, SMOKE_TEST_PASS

Verdict: COMPLIANCE FLAG — target-in-key hashed n-gram cache, same family as PR #779.

Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: CLOSE under the same ruling as the rest of the family-bug cluster. A context-only resubmission (drop the target from the lookup key and use a full-vocabulary reweighting from a single context row, per @valerio-oai's suggested legal path on #779) would be welcomed.

Reviewed by @MatoTeziTanka — The Agora. CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.06s, dim=512, layers=11, vocab=1024, code=115032 B, SMOKE_TEST_PASS. Classification via deterministic AST-based classify_prs.py (pattern bank derived from ~65 manually-reviewed PRs earlier in the 2026-04-11 sweep). This review was auto-drafted from a template and spot-checked before posting — if the template misread your code, please call it out so I can iterate the classifier.

Christopher-Lee-McClendon added 2 commits March 31, 2026 00:51

docs: add ablation studies section to NFM submission

24792dd

- 2x2 matrix: NFM x TTT with base no-TTT baseline (1.12087) - Loss weight sweep: 0.01, 0.05, 0.1, 0.2 - Hidden dim sweep: 128, 256, 512 - 13 SLURM jobs submitted (6 train + 7 eval) - Results pending, will update when jobs complete

Christopher-Lee-McClendon changed the title ~~Non-record: 11L NativeFlowMatcher + Legal TTT — val_bpb 1.1199 (single seed)~~ Non-record: 11L NativeFlowMatcher + Legal TTT — val_bpb 1.1199 (3-seed mean no-TTT: 1.1225) Apr 1, 2026

Christopher-Lee-McClendon mentioned this pull request Apr 1, 2026

feat: Non-record 11L PR940 Stack (no n-gram in use) + 20k Steps + Legal TTT (1.0929 BPB) #1232

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Non-record: 11L NativeFlowMatcher + Legal TTT — val_bpb 1.1199 (3-seed mean no-TTT: 1.1225)#1170

Non-record: 11L NativeFlowMatcher + Legal TTT — val_bpb 1.1199 (3-seed mean no-TTT: 1.1225)#1170
Christopher-Lee-McClendon wants to merge 4 commits intoopenai:mainfrom
Christopher-Lee-McClendon:submission/11L-nativeflow-legal-ttt

Christopher-Lee-McClendon commented Mar 31, 2026 •

edited

Loading

Uh oh!

Christopher-Lee-McClendon commented Mar 31, 2026

Uh oh!

MatoTeziTanka commented Apr 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Christopher-Lee-McClendon commented Mar 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Results

Architecture

Training

Ablation Studies

Limitations

Credits

Checklist

Uh oh!

Christopher-Lee-McClendon commented Mar 31, 2026

Ablation Studies Submitted

2×2 Matrix: NFM × Legal TTT

NFM Hyperparameter Sweeps

Also pending

Uh oh!

MatoTeziTanka commented Apr 11, 2026

Community Review — Non-record: 11L NativeFlowMatcher + Legal TTT — val_bpb 1.1199 (3-seed mean no-TTT: 1.1225)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Christopher-Lee-McClendon commented Mar 31, 2026 •

edited

Loading